67 research outputs found
Colorization as a Proxy Task for Visual Understanding
We investigate and improve self-supervision as a drop-in replacement for
ImageNet pretraining, focusing on automatic colorization as the proxy task.
Self-supervised training has been shown to be more promising for utilizing
unlabeled data than other, traditional unsupervised learning methods. We build
on this success and evaluate the ability of our self-supervised network in
several contexts. On VOC segmentation and classification tasks, we present
results that are state-of-the-art among methods not using ImageNet labels for
pretraining representations.
Moreover, we present the first in-depth analysis of self-supervision via
colorization, concluding that formulation of the loss, training details and
network architecture play important roles in its effectiveness. This
investigation is further expanded by revisiting the ImageNet pretraining
paradigm, asking questions such as: How much training data is needed? How many
labels are needed? How much do features change when fine-tuned? We relate these
questions back to self-supervision by showing that colorization provides a
similarly powerful supervisory signal as various flavors of ImageNet
pretraining.Comment: CVPR 2017 (Project page:
http://people.cs.uchicago.edu/~larsson/color-proxy/
Regularizing Deep Networks by Modeling and Predicting Label Structure
We construct custom regularization functions for use in supervised training
of deep neural networks. Our technique is applicable when the ground-truth
labels themselves exhibit internal structure; we derive a regularizer by
learning an autoencoder over the set of annotations. Training thereby becomes a
two-phase procedure. The first phase models labels with an autoencoder. The
second phase trains the actual network of interest by attaching an auxiliary
branch that must predict output via a hidden layer of the autoencoder. After
training, we discard this auxiliary branch.
We experiment in the context of semantic segmentation, demonstrating this
regularization strategy leads to consistent accuracy boosts over baselines,
both when training from scratch, or in combination with ImageNet pretraining.
Gains are also consistent over different choices of convolutional network
architecture. As our regularizer is discarded after training, our method has
zero cost at test time; the performance improvements are essentially free. We
are simply able to learn better network weights by building an abstract model
of the label space, and then training the network to understand this
abstraction alongside the original task.Comment: to appear at CVPR 201
Visually grounded learning of keyword prediction from untranscribed speech
During language acquisition, infants have the benefit of visual cues to
ground spoken language. Robots similarly have access to audio and visual
sensors. Recent work has shown that images and spoken captions can be mapped
into a meaningful common space, allowing images to be retrieved using speech
and vice versa. In this setting of images paired with untranscribed spoken
captions, we consider whether computer vision systems can be used to obtain
textual labels for the speech. Concretely, we use an image-to-words multi-label
visual classifier to tag images with soft textual labels, and then train a
neural network to map from the speech to these soft targets. We show that the
resulting speech system is able to predict which words occur in an
utterance---acting as a spoken bag-of-words classifier---without seeing any
parallel speech and text. We find that the model often confuses semantically
related words, e.g. "man" and "person", making it even more effective as a
semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code;
accepted to Interspeech 201
- …